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Abstract Rating scales are used to elicit data about qualitative entities (e.g., 
research collaboration). This study presents an innovative method for reducing 
the number of rating scale items without the predictability loss. The “area 
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under the receiver operator curve method” (AUC ROC) is used. The presented 
method has reduced the number of rating scale items (variables) to 28.57% 
(from 21 to 6) making over 70% of collected data unnecessary. 

Results have been verified by two methods of analysis: Graded Response 
Model (GRM) and Confirmatory Factor Analysis (CFA). GRM revealed that 
the new method differentiates observations of high and middle scores. CFA 
proved that the reliability of the rating scale has not deteriorated by the scale 
item reduction. Both statistical analysis evidenced usefulness of the AUC ROC 
reduction method. 

Keywords Rating scale • Prediction • Receiver operator characteristic • 
Reduction 

Mathematics Subject Classification (2000) 94A50 • 62C25 • 62C99 • 
62P10 


1 Introduction 

Rating scales (also called assessment scale) are used to elicit data about quan¬ 
titative entities (e.g., research collaboration as in El)- Often, predictability of 
rating scales (also called “assessment scales”) could be improved. Rating scales 
often use values: “1 to 10” and some rating scales may have over 100 items 
(questions) to rate. Other popular terms for rating scales are: survey and ques¬ 
tionnaire although a questionnaire is a method of data collection while survey 
may not necessarily be conducted by questionnaires. Some surveys may be 
conducted by interviews or by analyzing web pages. Rating itself is very popu¬ 
lar on the Internet for “Customer Reviews” where often uses five stars (e.g., by 
Amazon.com) instead of ordinal numbers. One may regard such rating as a one 
item rating scale. Surveys are used in [TO] on Fig.l (with the caption: “Sketch 
of data integration in use for different purposes with interference points for 
standardisation”) as one of the main sources of data. 

A survey, based on the questionnaire, answered by 1,704 researchers from 
86 different countries, was conducted by the Scientometrics study [8] on the 
impact factor, which is regarded as a controversial metric. Rating scales were 
also used in [23] and [18]. In mum, a different type of the rating scale improve¬ 
ment was used (based on pairwise comparisons). The evidence of improving 
accuracy by pairwise comparisons is in m and [20] , 

According to [22] : 

... the differentiation of sciences can be explained in a large part by 
the diffusion of generic instruments created by research-technologists 
moving in interstitial arenas between higher education, industry, statis¬ 
tics institutes or the military. We have applied this analysis to research 
on depression by making the hypothesis that psychiatric rating scales 
could have played a similar role in the development of this scientific 
field. 
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The absence of a well-established unit (e.g., one kilogram or meter) for 
measuring the science compels us to use rating scales. They have great ap¬ 
plication to scientometrics for measuring and analyzing performance based 
on subjective assessments. Even granting academic degrees is based on rating 
scales (in this case, several exams which are often given to students by ques¬ 
tionnaires). Evidently, we regard this rating scale as accurate otherwise our 
academic degrees may not have much value. 

The importance of subjectivity processing was driven by the idea of bounded 
rationality , proposed by Herbert A. Simon (the Nobel Prize winner), as an 
alternative basis for the mathematical modelling of decision making. 


2 The data model 

Data collected by a rating scale with fixed number of items (questions) are 
stored in a table with one decision (in our case, binary) variable. The parametrized 
classifier is usually created by total score of all items. Outcome of such rat¬ 
ing scales is usually compared to external validation provided by assessing 
professionals (e.g., grant application committees). 

Our approach not only reduces the number of items but also sequences 
them according to the contribution to predictability. It is based on the Receiver 
Operator Characteristic (ROC) which gives individual scores for all examined 
items. 


2.1 Predictability measures 

The term “receiver operating characteristic” (ROC), or “ROC curve” was 
coined for a graphical plot illustrating the performance of radar operators 
(hence “operating”). A binary classifier represented absence or presence of 
an enemy aircraft and was used to plot the fraction of true positives out of 
the total actual positives (TPR = true positive rate) vs. the fraction of false 
positives out of the total actual negatives (FPR = false positive rate). Positive 
instances (P) and negative instances (N) for some condition are computed and 
stored as four outcomes a 2 contingency table or confusion matrix, as follows: 


Table 1 The confusion matrix 

True Positives False Positives 
False Negative True Negative 


In assessment and evaluation research, the ROC curve is a representation 
of a “separator” (or decision) variable. The decision variable is usually: “has 
a property” or “does not have a property” or has some condition to meet 
(pass/fail). The frequencies of positive and negative cases of the diagnostic 
test vary for the “cut-off” value for the positivity. By changing the “cut-off” 
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value from 0 (all negatives) to a maximum value (all positives), we obtain the 
ROC by plotting TPR (true positive rate also called sensitivity) versus FPR 
(false positive also called specificity) across varying cut-offs, which generate a 
curve in the unit square called an ROC curve. 

According to the area under the curve (the AUC or AUROC) is equal 
to the probability that a classifier will rank a randomly chosen positive instance 
higher than a randomly chosen negative one (assuming the ’positive’ rank 
higher than ’negative’). 

AUC is closely related to the Mann-Whitney U test which tests whether 
positives are ranked higher than negatives. It is also equivalent to the Wilcoxon 
test of ranks. The AUC is related to the Gini coefficient given by the formula 

Gi = 2 * AUC — 1, (1) 


where: 

n 

G 1 = \- Y,( x k - X k -1 ){Y k + Yk- 1 ) 

fc=i 

In this way, it is possible to compute the AUC using an average of a number 
of trapezoidal approximations. Practically, all advanced statistics can be ques¬ 
tioned and they often gain recognition after their intensive use. The number of 
publications with ROC listed by PubMed.com has exploded in the last decade 
and reached 3,588 in 2013. An excellent tutorial-type introduction to ROC is 
in m- It was introduced during the World War II for evaluation of perfor¬ 
mance the radar operators. Its first use in health-related sciences, according 
to Medline search, is traced to nu. 


Table 2 AUC of individual variables in the original data 


Var 

AUC 

Var 

AUC 

Var 

AUC 

21 

0.587468 

12 

0.636791 

17 

0.674283 

11 

0.597342 

13 

0.648917 

15 

0.692064 

16 

0.605937 

4 

0.651187 

10 

0.697225 

6 

0.610004 

3 

0.655666 

9 

0.700461 

18 

0.610028 

5 

0.658478 

7 

0.701489 

19 

0.629285 

20 

0.666999 

14 

0.707401 

2 

0.631205 

8 

0.667983 

1 

0.725009 


2.2 Validation of the predictability improvement 

Supervised learning is the process of inferring a decision (of classification) 
from labeled training data. However, the supervised learning may also employ 
other techniques, including statistical methods that summarize and explain 
key features of the data. For the unsupervised learning, clustering is the most 
popular method for analyzing data. The k-means clustering optimizes well for 
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the given number of classes. In our case, we have two classes: 0 for “negative” 
and 1 for “positive” outcome of diagnosis for depression. 

The area under the receiver operating characteristic curve (AUC) reflects 
the relationship between sensitivity and specificity for a given scale. An ideal 
scale has an AUC score equal to 1 but it is not realistic in clinical practice. 
Cutoff values for positive and negative tests can influence specificity and sen¬ 
sitivity, but they do not affect AUC. The AUC is widely recognized as the 
performance measure of a diagnostic test’s discriminatory power (see mm)- 
In our case, the input data have AUC of 81.17%. 

The following System R code was used to compute the AUC for all 21 
individual items: 
library(caTools) 

# read data from a csv file 

mydata = read.csv(”C: \\BDI571.csv”) 

y = mydata[,l] 

result j-matrix(nrow=22,ncol=2); 
ind=2; 

for (i in 2:22) 

{ 

result [ind,]=colAUC(cbind(mydata[,l], 

mydata[,i]),y, plotROC=FALSE, alg=”ROC”) 
ind = ind+l 

} 

System R code 

When AUC values are computed for all individual variables, we arrange 
them in an ascending order. These variables are present in Table [3] in bold. 
Values in the row below running total up to the current variable. Evidently, 
the first value 0.725 is the same as in Table [2] since the running total is the 
single variable 1. However, the third value in the second row (0.795) is not 
for variable 7 but the total of variables 1, 14, and 7. In particular, the last 
value (0.812) in Tab. [3] is for the total of all variables. Frankly, these numbers 
are very close to each other but their line plot [T] demonstrates its usefulness. 
The curve peek is for variable #6 which is 15. There is a slight decline until 
variable 16. 


Table 3 AUC of running variable totals 


1 

14 

7 

9 

10 

15 

17 

0.725 

0.777 

0.795 

0.810 

0.813 

0.822 

0.821 

8 

20 

5 

3 

4 

13 

12 

0.819 

0.820 

0.821 

0.821 

0.821 

0.821 

0.820 

2 

19 

18 

6 

16 

11 

21 

0.819 

0.818 

0.816 

0.814 

0.812 

0.811 

0.812 
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3 Relating the results to Graded Response Model 

Let us examine how our results can be related to the Graded Response Model 
(GRM). GRM is equivalent of Item Response Theory, well addressed by a 
Wikipedia article, but used for ordinary, not binary, data. GRM is usually 
conducted to establish the usefulness of test items @|. 

GRM is used in psychometric scales to determine the level of three char¬ 
acteristics of each item, namely: a) item’s difficulty, b) item’s discriminant 
power, and c) item’s guessing factor. 

Item’s difficulty describes how difficult or easy it is for individuals to 
answer on the item. High positive value means that the item is very difficult, 
high negative value means that the item is very easy. 

Item’s discriminant power describes ability for a specific item to dis¬ 
tinguish among upper and lower ability individuals’ on a test. 

Item’s guessing factor describes probability that individual with low 
feature (low depression) achieved high scores in this item. 

The aim of our analysis was to establish whether or not the GRM indicates 
the same items as the proposed method based on AUC. Two GRM models were 
build for the given rating scales: 

constrained (that assumes equal discrimination parameters across items), 
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unconstrained (that assumes unequal discrimination parameters across 
items). 

System R Itm package (24] was used in our analysis. Fig. [^illustrates system 
R code for GRM models. 


fit <-grm (BD1571,conetrained-TRUE) 
grm (BDI571,conetraaned*TRUE) 

flt2 <-grn (BDI571) 
grm (BD1S71) 

anova (flll,flt2) 

Fig. 2 System R code for GRM models. 


In order to check whether or not the unconstrained GRM provides a better 
fit than the constrained GRM, a likelihood ratio test was used. It revealed that 
unconstrained GRM is preferable (fit2 in Tab. [4]). The results of the Likelihood 
Ratio are presented in Tab. [4] 


Table 4 Likelihood Ratio for the full GRM model 



AIC 

BIC 

log.Lik 

LRT 

df 

p-value 

fit 1 
fit 2 

25494.12 

25367.63 

25772.35 

25732.81 

-12683.06 

-12599.81 

166.49 

20 

p < 0.001 


Tab.[5]shows the unconstrained GRM model results with the item discrim¬ 
ination power. It provides information on discrimination power of each item. 

Items selected by AUC ROC are shown in Tab. [5] as bold. Evidently, they 
have the large discrimination power (seen in the last column). All selected 
items discriminate between responses above the mean value (so on their basis 
we can discriminate between respondents with severe and moderate level of 
depression). Discrimination power is a characteristic of items in the scale. It is 
a measurement method which aim is to assess how respondents differ in their 
answers on rating scale items. The larger is the discrimination power of the 
item, the better, more useful is item in the scale |T|. Items computed by the 
proposed (AUC ROC) method have a good discrimination as it can be seen 
in the Table 4 (for example, number 1.799 means that item VI has a good 
discrimination power). 

All items of the given rating scale give 56.21% of total information for the 
latent trait and the latent variable (adolescent depression in school in our case). 
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Table 5 Unconstrained GRM model results for the full rating scale and the item discrimi¬ 
nation power 



Extrmtl 

Extrmt2 

Extrmt3 

Dscrmn 

VI 

0.178 

1.099 

2.542 

1.799 

V2 

0.214 

1.536 

2.485 

1.315 

V3 

-0.094 

1.306 

3.077 

1.528 

V4 

-0.447 

1.673 

3.511 

1.268 

V5 

-0.709 

1.903 

2.717 

1.440 

V6 

-0.073 

1.679 

2.194 

0.860 

V7 

-0.410 

0.970 

2.170 

1.459 

V8 

-0.641 

0.876 

2.157 

1.461 

V9 

0.092 

1.895 

2.471 

1.405 

V10 

-0.183 

0.834 

1.665 

1.023 

Vll 

-0.881 

2.187 

3.280 

0.767 

V12 

-0.242 

1.282 

2.221 

1.271 

V13 

-0.351 

1.631 

2.660 

1.054 

V14 

-0.038 

0.918 

2.353 

1.951 

V15 

-0.627 

1.046 

2.248 

1.593 

V16 

-2.364 

0.634 

2.388 

0.764 

V17 

-0.482 

1.287 

2.366 

1.296 

V18 

-1.685 

0.847 

2.024 

0.902 

V19 

-1.623 

0.366 

2.648 

1.078 

V20 

-0.575 

1.227 

2.066 

1.643 

V21 

1.271 

2.240 

3.531 

0.870 


Test Information Curve (see Tab. [6]) shows that six items provides 19.62% of 
the total information for latent trait. The higher is items’ discrimination, the 
more information or precision the scale provides. 

GRM model computes different items than our proposed method. AUC 
ROC is based on the count of true and false positive rate while GRM model 
is based on the maximum likelihood estimate. The proposed method has a 
bigger diagnostic power. Diagnostic power is the ability of the test to detect 
all subjects, which have been measured by the test characteristics (in our 
case, for depression). A test with the maximum diagnostic power would detect 
all subjects (suffering from depression). Unfortunately, the most selections of 
rating scale items do not compare solutions with the diagnostic criterion. That 
is why the proposed method is so useful for the selection of items in different 
measurement tools (examination, tests, socio-metrical scales, psychometrical 
scales, and many others). 

We used GRM model here to show that even such powerful method like 
GRM (used in psychometrics to indicate which items can discriminate sub¬ 
jects), does not provide an answer to a question about diagnostic accuracy 
of items. According to GRM items, V2 and V3 (Table 5) have a consider¬ 
able discriminant power, but the proposed method shows which items better 
discriminate between subjects on the basis of diagnostic criteria. 
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Table 6 Test information curve. 

Total information = 56.21 
Information in (-4, 4) = 52 (92.51%) 
Based on all the items 
Total Information = 19.62 
Information in (-4, 4) = 18.97 (96.65%) 
Based on items 1, 7, 9, 10, 14, 15 


4 Reduced scale psychometric properties 

Confirmatory Factor Analysis (CFA) [15] [5] was used to verify the structure of 
our results. CFA is a factor analysis which purpose is to verify the structural 
validity whether items belong to scales and what are their factor loading. 
Factor loading measures the relations between observed variable (item) and 
latent feature (scale). The higher the factor loading, the stronger the relation, 
and the item has greater importance in the scale. More specifically, CFA was 
used to determine whether: 

— items indicated by AUC form a coherent scale that exhibits good reliability, 

— the reliability of the rating scale has not deteriorated by the scale item 
reduction. 

Two CFA models were built. The first CFA model has all items and the 
second CFA model has a reduced number of items. Since items of the scale 
have categorical format, the robust estimator WLSMV (weighted least squares 
means and variance, see 0) was used as it is designed for categorical scales. 
The robust estimator resists the lack of normal distributions. The analysis was 
conducted in “lavaan” package of R program. 


> HS.aodeK-'depreB8lon“"Vl+V7»V9+V10»V14*VlS’ 

> f lt<-cf a(HS.Bodel..data-BDIS71 ,ordered-c("Vl",*V7*.*V9*."V10*." V14*."VIS*)) 

> s hob ary(fit,fIt.measuree-TRUE.8tandardlze-TRUE) 


> HS.BOdeK- ' depression-" Vl+V2*-V3tV4+VStV6*V7*V8+V9+V10*Vll*V12+V13*V14*VlS 
♦V16*V17*V18-»V19+V20»V21’ 

> flt<-cfa(HS.nodel,data-BDI571.ordered-c("Vl","V2",*V3*,"V4","V5*,*V6*. 
*V7*,"V8","V9",*V10",*V11",*V12",*V13*,*V14",*V1S" > *V16" 

,"V17*,"V18“,*V19",*V20",*V21")) 

> 8 urinary (fit.fit measur ee-TRUE, s t andar dlze-TRUE) 

Fig. 3 System R code for CFA 


The model for the full rating scale is presented by Fig. [4] Table [7] presents 
parameter estimates of the full rating scale. Loads of those items, which have 
been identified by the presented method as having the greatest predictive 
power, is in bold in Tab. [7] A model with a reduced number of items is in 
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Fig. 4 CFA model for the rating scale with all items presented in AMOS graphics 


Table 7 Parameter estimates of the full rating scale 


parameters 

standardized 

non-standardized 

standardized error 

Avi 

0.716 

1.000 


Ay2 

0.604 

0.844 

0.048 

Av3 

0.655 

0.914 

0.047 

\V4, 

0.585 

0.818 

0.049 

A vs 

0.633 

0.884 

0.049 

Av6 

0.454 

0.634 

0.052 

Av7 

0.636 

0.889 

0.048 

A V8 

0.645 

0.901 

0.045 

Av9 

0.632 

0.883 

0.049 

Avio 

0.523 

0.731 

0.053 

Ayn 

0.394 

0.550 

0.053 

Avi 2 

0.594 

0.830 

0.048 

Avi3 

0.514 

0.718 

0.056 

Avi4 

0.737 

1.029 

0.049 

Avi5 

0.654 

0.913 

0.050 

Avi6 

0.413 

0.578 

0.055 

Avi7 

0.589 

0.822 

0.047 

Avi8 

0.468 

0.653 

0.053 

Avi9 

0.520 

0.726 

0.050 

Av20 

0.681 

0.952 

0.047 

Av21 

0.433 

0.605 

0.067 


Fig- El Tab. [8] presents parameter estimates for the reduced scale model. 

For the purpose of checking whether the models have a good fit, we used 
two fit indices: CFI (cross validation index) and RMSEA (root mean square 
error of approximation). According to jSdS], both CFA models have a good 
fit to the data as illustrated by Tab. [9] Values of CFI statistics for both mod¬ 
els exceeded the required level of 0.9. For both models, the values of RMSEA 
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Fig. 5 CFA Model with a reduced number of items presented in AMOS graphics 
Table 8 Parameter estimates for the reduced scale 


parameters 

standardized 

non-standardized 

standardized error 

Ayi 

0.748 

1.000 


\V7 

0.614 

0.821 

0.054 

Av9 

0.703 

0.940 

0.059 

Ayio 

0.534 

0.714 

0.061 

Ayi4 

0.736 

0.984 

0.059 

Avi5 

0.816 

0.816 

0.057 


statistics (lower than 0.08) indicates the good fitness of the proposed new scale 
structure for the given data. 


Table 9 Results of fit statistics for two rating scale models 


Statistics for the full and reduced rating scale models 


Chi2=437.899 

Chi2=30.883 

df=189 

df=9 

CFI=.950 

CFI=.983 

RMSEA=.048 

RMSEA=.065 


For both CFA models, construct reliability (CR) and variance extracted 
(VE) were computed. CR was computed by the formula (given in [T5]l: 


CR = 


(E"=i x iY 


(£"=1 X i) + (E"=l <5i) 


( 2 ) 
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where: 

i is a total number of items, 

A is a factor loading, 

<5 is an error variance, which is the amount of variability unexplained by the 
items in scale. 

The formula for computing variance extracted (VE) is based on T5|: 

V” A 2 

VE = 1 (3) 

n 

where: 

i is the number of items, 

A is a factor loading, 
n is a number of rating scale items. 


Table 10 Results of CR and VE of two models 


Rating scale 

CR 

VE 

Full rating scale 
Reduced scale 

0.929 

0.822 

0.394 

0.483 


The results revealed that the reliability of the reduced model CR = .822 and 
is lower than the reliability of the full model of 0.1 (CR=.929). Therefore, it can 
be concluded that the reliability of the scale is above the acceptability level. 
Removing 15 items has not impaired its reliability as Tab. [T0| demonstrates it. 

For the reduced model, VE=.438 while for the given model, VE=.394. 
Evidently, the new model has VE closer to criterion of .500. The reduced 
rating scale model has a better VE than the full rating scale model. It means 
that the reduced rating scale model explains the diversity of the results better 
than the full rating scale model (see Tab. 101. 

On the basis of factor loadings (A), we are unable to determine which 
items have the most predictive power. Items V3 or V20 have one of the top 
factor loadings in the full rating scale, but they do not still have the most 
predictive power. Therefore, it is impossible to indicate the ordinal number 
of the rating scale item according to the factor analyses, but it is possible by 
the proposed method and GRM. However, GRM cannot compare its solution 
with a diagnostic criterion while the proposed method can. 


5 Discussion 

The Beck Depression Inventory (BDI) was selected for our study since it is one 
of the best known and most widely used self-rating scales to assess the presence 
and severity of depressive symptoms [7l fl2lll6] . Our data were collected in high 
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schools m ■ However, it needs to be stressed that our method is applicable 
to practically all rating scales. 

In summary, both models fit the data well. Both of them have a good 
reliability and a relatively good variance. Reducing the number of items did 
not burden psychometric properties, but simplified the whole structure (as 
indicated by the smaller number of degrees of freedom). According to the 
Occam’s Razor law, the simpler models, the better. Although it was not the 
main objective of this study, it is worth to notice that the six rating scale items 
have a better predictive power in our study than other 21 items. We have also 
demonstrated that our results have the domain (semantic) meaning. 


6 Conclusions 

The presented method has reduced the number of the rating scale items (vari¬ 
ables) to 28.57% (from 21 to 6) making over 70% of collected data unnecessary. 
It is not only an essential budgetary saving, as data collection is usually ex¬ 
pensive, but it often contributes to the data collection error reduction. The 
more data are collected, the more errors are expected to occur. When we use 
the proposed AUC ROC reduction method, the predictability has increased 
by approximately 0.5%. It may seem insignificant but for a large population, 
it is not so. In fact, m states that: “Taken together, mental, neurological 
and substance use disorders exact a high toll, accounting for 13% of the total 
global burden.” 

The proposed use of AUC for reducing the number of rating scale items 
is innovative and applicable to practically all rating scales. System R code is 
posted on the Internet for the general use. A package for System R is under 
development. Certainly, more validation cases would be helpful and the assis¬ 
tance will be provided to anyone who wishes to try this method using his/her 
data. 


Supporting Information 

The source code will be deposited at SourceForge.net hosting provider (see 
[25]). According to [25], SourceForge “creates powerful software in over 400,000 
open source projects ans hosts over 3.7 million registered users”. It connects 
well over 40 million customers with more than 4,800,000 downloads a day. 
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